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Abstract 

We consider the problem of simultaneously learning to linearly combine a very large 
number of kernels and learn a good predictor based on the learnt kernel. When the 
^ number of kernels d to be combined is very large, multiple kernel learning methods whose 

computational cost scales linearly in d are intractable. We propose a randomized version 
of the mirror descent algorithm to overcome this issue, under the objective of minimizing 
the group p-norm penalized empirical risk. The key to achieve the required exponential 
speed-up is the computationally efficient construction of low-variance estimates of the 
gradient. We propose importance sampling based estimates, and find that the ideal 
distribution samples a coordinate with a probability proportional to the magnitude of 
the corresponding gradient. We show the surprising result that in the case of learning 
^* the coefficients of a polynomial kernel, the combinatorial structure of the base kernels 

. to be combined allows the implementation of sampling from this distribution to run 

in 0(log(d)) time, making the total computational cost of the method to achieve an e- 
optimal solution to be 0(log(d) /e 2 ), thereby allowing our method to operate for very large 
values of d. Experiments with simulated and real data confirm that the new algorithm is 
computationally more efficient than its state-of-the-art alternatives. 
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1 Introduction 

We look into the computational challenge of finding a good predictor in a multiple kernel 
learning (MKL) setting where the number of kernels is very large. In particular, we are 
interested in cases where the base kernels come from a space with combinatorial structure 
and thus their number d could be exponentially large. Just like some previous works (e.g. 
Rakotomamonjy et al., 2008; Xu et al., 2008; Nath et al., 2009) we start with the approach 
that views the MKL problem as a nested, large scale convex optimization problem, where 
the first layer optimizes the weights of the kernels to be combined. More specifically, as the 
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objective we minimize the group p-norm penalized empirical risk. However, as opposed to 
these works whose underlying iterative methods have a complexity of Cl(d) for just any one 
iteration, following (Nesterov, 2010, 2012; Shalev-Shwartz and Tewari, 2011; Richtarik and 
Takac, 2011) we use a randomized coordinate descent method, which was effectively used in 
these works to decrease the per iteration complexity to O(l). The role of randomization in our 
method is to use it to build an unbiased estimate of the gradient at the most recent iteration. 
The issue then is how the variance (and so the number of iterations required) scales with d. 
As opposed to the above mentioned works, in this paper we propose to make the distribution 
over the updated coordinate dependent on the history. We will argue that sampling from a 
distribution that is proportional to the magnitude of the gradient vector is desirable to keep 
the variance (actually, second moment) low and in fact we will show that there are interesting 
cases of MKL (in particular, the case of combining kernels coming from a polynomial family 
of kernels) when efficient sampling (i.e., sampling at a cost of O '(log d)) is feasible from this 
distribution. Then, the variance is controlled by the a priori weights put on the kernels, 
making it potentially independent of d. Under these favorable conditions ( and in particular, 
for the polynomial kernel set with some specific prior weights), the complexity of the method 
as a function of d becomes logarithmic, which makes our MKL algorithm feasible even for 
large scale problems. This is to be contrasted to the approach of Nesterov (2010, 2012) where 
a fixed distribution is used and where the a priori bounds on the method's convergence rate, 
and, hence, its computational cost to achieve a prescribed precision, will depend linearly on d 
(note that we are comparing upper bounds here, so the actual complexity could be smaller). 
Our algorithm is based on the mirror descent (or mirror descent) algorithm (similar to the 
work of Richtarik and Takac (2011) who uses uniform distributions). 

It is important to mention that there are algorithms designed to handle the case of 
infinitely many kernels, for example, the algorithms by Argyriou et al. (2005, 2006); Gehler 
and Nowozin (2008). However, these methods lack convergence rate guarantees, and, for 
example, the consistency for the method of Gehler and Nowozin (2008) works only for "small" 
d. The algorithm of Bach (2008), though practically very efficient, suffers from the same 
deficiency. A very interesting proposal by Cortes et al. (2009) considers learning to combine 
a large number of kernels and comes with guarantees, though their algorithm restricts the 
family of kernels in a specific way. 

The rest of the paper is organized as follows. The problem is defined formally in Section 2. 
Our new algorithm is presented and analyzed in Section 3, while its specialized version 
for learning polynomial kernels is given in Section 4. Finally, experiments are provided in 
Section 5. 

2 Preliminaries 

In this section we give the formal definition of our problem. Let I denote a finite index set, 
indexing the predictors (features) to be combined, and define the set of predictors considered 
over the input space X as F = {f u , : X — > R : f w {x) = X^ex ( w i> ^ii 30 )) > x e <^}- Here Wi 
is a Hilbert space over the reals, (j>i : X — )■ Wi is a feature-map, (x, y) is the inner product 
over the Hilbert space that x,y belong to and w = {wiji^x £ W = XjgjWj (as an example, 
Wj may just be a finite dimensional Euclidean space). The problem we consider is to solve 
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the optimization problem 

minimize L n (f w ) + Pen(f w ) subject to w G W , (1) 

where Pen(f w ) is a penalty that will be specified later, and L n (f w ) = ^ Y^t=i £t(fw(xt))is the 
empirical risk of predictor f w , defined in terms of the convex losses it : M — > M. (1 < t < n) 
and inputs xt G X (1 < t < n). The solution w* of the above penalized empirical risk 
minimization problem is known to have favorable generalization properties under various 
conditions, see, e.g., Hastie et al. (2009). In supervised learning problems £t(y) = £{yt,y) 
for some loss function I : K x M. — > M, such as the squared-loss, £(yt, y) = \{y — yt) 2 , or the 
hinge-loss, £t(yt,y) = max(l — yyt,0), where in the former case yt G R, while in the latter 
case yt G {— 1,+1}. We note in passing that for the sake of simplicity, we shall sometimes 
abuse notation and write L n (w) for L n (f w ) and even drop the index n when the sample-size 
is unimportant. 

As mentioned above, in this paper we consider the special case in (1) when the penalty 
is a so-called group p-norm penalty with 1 < p < 2, a case considered earlier, e.g., by Kloft 
et al. (2011). Thus our goal is to solve 

minimize L n (w) + \ (YV|K||?M , (2) 

where the scaling factors pi > 0,i £ I, are assumed to be given. We introduce the notation 
u = (ui) G IR" 1 to denote the column vector obtained from the values ui. 

The rationale of using the squared weighted p-norm is that for 1 < p < 2 it is expected to 
encourage sparsity at the group level which should allow one to handle cases when I is very 
large (and the case p = 2 comes for free from the same analysis). The actual form, however, 
is also chosen for reasons of computational convenience. In fact, the reason to use the 2-norm 
of the weights is to allow the algorithm to work even with infinite-dimensional feature vectors 
(and thus weights) by resorting to the kernel trick. To see how this works, just notice that 
the penalty in (2) can also be written as 

I « , <T? II nidi 3 

2-p I 

where for v > 1, A u = {9 G [0, 1]^ : \\0\\ u < 1} is the positive quadrant of the |X|-dimensional 
^-ball (see, e.g., Micchelli and Pontil, 2005, Lemma 26). Hence, defining 




J(w,d) = L(w) + lYl 



/»i Nil! 



2 -,T 

i Si 

for any w G W,9 G [0, l]'" 1 ', an equivalent form of (2) is 

minimize J(w,6) (3) 

where v = p/(2 — p) G [1, oo) and we define 0/0 = and u/0 = oo for u > 0, which implies 
that Wi = if 9i = 0. That this minimization problem is indeed equivalent to our original 
task (2) for the chosen value of v follows from the fact that J(w, 9) is jointly convex in (w, 6). 1 



1 Here and in what follows by equivalence we mean that the set of optimums in terms of w (the primary 
optimization variable) is the same in the two problems. 
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Let Ki : X x X — > M. be the reproducing kernel underlying cpf. tti(x,x') = (4>i(x), 4>i(x')) 
(x,x' G X) and let %j = H Ki the corresponding reproducing kernel Hilbert space (RKHS). 
Then, for any given fixed value of 0, the above problem becomes an instance of a standard 
penalized learning problem in the RKHS He underlying the kernel Kg = Yliex^iP7 K «- ^ n 
particular, by the theorem on page 353 in Aronszajn (1950), the problem of finding w € W 
for fixed can be seen to be equivalent to minimize f^Hg L(f) + and thus (2) is 

seen to be equivalent to minimizej e -^ 9i e e A„ L(f) + • Thus, we see that the method 

can be thought of as finding the weights of a kernel Kg and a predictor minimizing the T~Lg- 
norm penalized empirical risk. This shows that our problem is an instance of multiple kernel 
learning (for an exhaustive survey of MKL, see, e.g., Gonen and Alpaydm, 2011 and the 
references therein). 

3 The new approach 

When I is small, or moderate in size, the joint-convexity of J allows one to use off-the-shelf 
solvers to find the joint minimum of J. However, when X is large, off-the-shelf solvers might 
be slow or they may run out of memory. Targeting this situation we propose the following 
approach: Exploiting again that J(w,9) is jointly convex in (w,Q), find the optimal weights 
by finding the minimizer of 

J(0) = hdJ(w,e), 

w 

or, alternatively, J{9) = J(w*(9),9), where w*{9) = argmin^ J(w, 9) (here we have slightly 
abused notation by reusing the symbol J). Note that J (9) is convex by the joint convexity of 
J(w, 9). Also, note that w*(9) exists and is well-defined as the minimizer of «/(•, 9) is unique 
for any 9 G A u (see also Proposition 3.2 below). Again, exploiting the joint convexity of 
J(w,9), we find that if 9* is the minimizer of J (9), then w*(9*) will be an optimal solution 
to the original problem (2). To optimize J(9) we propose to use stochastic gradient descent 
with artificially injected randomness to avoid the need to fully evaluate the gradient of J. 
More precisely, our proposed algorithm is an instance of a randomized version of the mirror 
descent algorithm (Rockafellar, 1976; Martinet, 1978; Nemirovski and Yudin, 1998), where in 
each time step only one coordinate of the gradient is sampled. 

3.1 A randomized mirror descent algorithm 

Before giving the algorithm, we need a few definitions. Let d = A C M. d be nonempty 
with a convex interior A°. We call the function ^ : A — > R a Legendre (or barrier) potential 
if it is strictly convex, its partial derivatives exist and are continuous, and for every sequence 
{xfc} C A approaching the boundary of A, limfe_>.oo = oo. Here V is the gradient 

operator: V^(x) = (-^^(x)) T is the gradient of When V is applied to a non-smooth 
convex function J'(9) (J may be such without additional assumptions) then V J'{ff) is defined 
as any subgradient of J' at 9. The corresponding Bregman-divergence : A x A° — > R 
is defined as Dy(9,9') = tf(0) - *(6>') - (W(0'),0 - 9'). The Bregman projection H 9>K : 
A° —7- K corresponding to the Legendre potential \P and a closed convex set K C M. d such 
that K n A / is defined, for all 9 G A° as 11^(0) = arg mm 6 , e K nA D^(9' , 0). 

Algorithm 1 shows a randomized version of the standard mirror descent method with an 
unbiased gradient estimate. By assumption, rjk > is deterministic. Note that step 1 of the 
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Algorithm 1 Randomized mirror descent algorithm 



1: Input: A, K C M. d , where K is closed and convex with ^ : A — >• K Legendre, 

step sizes {rjk}, a subroutine, GradSampler, to sample the gradient of J at an arbitrary 
vector 9 > 

2: Initialization: 9^ = &rgmm d( z KnA ^(9), k = 0. 
3: repeat 

4: k = k + l. 

5: Obtain g k = GradSampler^*^ 1 )) 

6: m = arg min eeA { (g k ,9) + (9, 9^)}. 

7-. 6^ = ny >K (§W). 

8: until convergence. 



algorithm is well-defined since 9^ G A° by the assumption that HV^a;)!! tends to infinity 
as x approaches the boundary of A. The performance of Algorithm 1 is bounded in the next 
theorem. The analysis follows the standard proof technique of analyzing the mirror descent 
algorithm (see, e.g., Beck and Teboulle, 2003), however, in a slightly more general form than 
what we have found in the literature. In particular, compared to (Nemirovski et al., 2009a; 
Nesterov, 2010, 2012; Shalev-Shwartz and Tewari, 2011; Richtarik and Takac, 2011), our 
analysis allows for the conditional distribution of the noise in the gradient estimate to be 
history dependent. The proof is included in Section A in the appendix. 

Theorem 3.1. Assume that ^ is a-strongly convex with respect to some norm \\ ■ \\ (with 
dual norm \\ ■ \\*) for some a > 0, that is, for any 9 E A°, 9' £ A 



V(0) > (Vtf(0),0 ; - 9) + fUe' - 9\ 



(4) 



Suppose, furthermore, that Algorithm 1 is run for T time steps. For < k < T — 1 let T k 
denote the a -algebra generated by 9\, . . . , 9 k . Assume that, for all 1 < k < T , g k G M rf is an 
unbiased estimate of VJ(0^ fc_1 - > ) given T^—i, that is, 

nh\^k-i\ = ^j{o i - k - l) ). (5) 

Further, assume that there exists a deterministic constant B > such that for all 1 < k <T, 



E [11^*1 

Finally, assume that 5 = sup g , eKnA ^(l 
k > 1, it holds that 



- <J>(#(°)) is finite. Then, if rjk-i 



(6) 



2aS for all 



E 



J 



Furthermore, if 



k=l 



mf J{e)<\ ~. 

eeh'nA w ~ V aT 



\9k\ 



<B' 



a.s. 



(7) 



(8) 
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for some deterministic constant B' and rjk-i = y Jp^ for all k > 1 then, for any < e < 1, 
it holds with probability at least 1 — e that 




2B'6 B>61og± 
- inf J(0)<J—£ + to -p.. 9 

The convergence rate in the above theorem can be improved if stronger assumptions are 
made on J, for example if J is assumed to be strongly convex, see, for example, (Hazan et al., 
2007; Hazan and Kale, 2011). 

Efficient implementation of Algorithm 1 depends on efficient implementations of steps 
1-1, namely, computing an estimate of the gradient, solving the minimization for 9~( k \ and 
projecting it into K. The first problem is related to the choice of gradient estimate we use, 
which, in turn, depends on the structure of the feature space, while the last two problems 
depend on the choice of the Legendre function. In the next subsections we examine how these 
choices can be made to get a practical variant of the algorithm. 



3.2 Application to multiple kernel learning 

It remains to define the gradient estimates g^ in Algorithm 1. We start by considering 
importance sampling based estimates. First, however, let us first verify whether the gradient 
exist. Along the way, we will also derive some explicit expressions which will help us later. 



Closed-form expressions for the gradient. Let us first consider how w*{6) can be 
calculated for a fixed value of 9. As it will turn out, this calculation will be useful not 
only when the procedure is stopped (to construct the predictor f w *rg) but also during the 
iterations when we will need to calculate the derivative of J with respect to 9% . The following 
proposition summarizes how w* (9) can be obtained. Note that this type of result is standard 
(see, e.g., Shawe- Taylor and Cristianini, 2004; Scholkopf and Smola, 2002), thus we include 
it only for the sake of completeness (the proof is included in Section A in the appendix). 

Proposition 3.2. For 1 < t < n, let l\ : R — > R denote the convex conjugate of it: &ti v ) = 
sup TgK {vt — £t{r)}, v £ M>. For i £ 1, recall that Ki(x,x') = (4>i{x), 4>i(x')), and let K-i = 
(Ki(xt,x s ))i<t lS <n be the n x n kernel matrix underlying Ki and let ICe = Y2ieX be the 

kernel matrix underlying Kg = X^ex^ Ki- Then, for any fixed 9, the minimizer w*{9) of 
J (-,9) satisfies 

9 n 

P i t=l 

where 

a* (9) = argmin J -a T lC e a + - t t (-na t ) \ . (11) 

Based on this proposition, we can compute the predictor f w *^ using the kernels 
and the dual variables (a*(0))i<t< n : f w *(o)(x) = Y^iex ( w i(0)> 4>i{x)) = Ylt=i a t (9)Kg(x t , x) . 



6 



Let us now consider the differentiability of J = J{9) and how to compute its derivatives. 
Under proper conditions with standard calculations (e.g., Rakotomamonjy et al., 2008) we 
find that J is differentiable over A and its derivative can be written as 2 

9 rr a \ fa*(6) T JCia*(9) 



e* m ~ ~ l^r^J« ' (12) 

Importance sampling based estimates. Let d = |X| and let ej, i G X denote the i th unit 
vector of the standard basis of M d , that is, the i th coordinate of is 1 while the others are 
0. Introduce 



9k,. 



(vJ^-^eA, iel (13) 



to denote the i th component of the gradient of J in iteration k (that is, g^i can be computed 
based on (12)). Let Sk-i £ [0, 1] be a distribution over X, computed in some way based on 
the information available up to the end of iteration k — 1 of the algorithm (formally, s/u_i is 
J-fc_i-measurable). Define the importance sampling based gradient estimate to be 

9k,i = 9k,l k , i G X, where I k ~ s fc _! . (14) 

s fe-l,4 

That is, the gradient estimate is obtained by first sampling an index from Sk-i t . and then 
setting the gradient estimate to be zero at all indices i £ X except when i = 1^ in which 
case its value is set to be the ratio 9k '' k . It is easy to see that as long as s^-i^i > holds 

s k-l,I k 

whenever g^i ^ 0, then it holds that E [gu\ J-'k-i] = V J(6>( fc_1 )) a.s. 

Let us now derive the conditions under which the second moment of the gradient estimate 
stays bounded. Define Ck-i = ||VJ(0^' c_1 ^)|| 1 . Given the expression for the gradient of J 
shown in (12), we see that sup fc>1 C^-i < oo will always hold provided that a*(6) is continuous 
since (0( fc-1 ))k>i is guaranteed to belong to a compact set (the continuity of a* is discussed 
in Section B in the appendix). 

Define the probability distribution <7fc_i ■ as follows: qk-n = — \9ki\ ; i € Z. Then 

' ' ^k — 1 ' 

Ik 1 I 

it holds that \\g~k\\l = tt^ — 9ki k \\ e i k \\* = \\ e h II*- Therefore, it also holds 

that E[||^|| 2 |X fc _i] = ^LiEieilgflleill* < ^ max, eX |^||e 4 || 2 . This shows that 
sup fc>1 E [ ll^fcll 2 1 Xfe_i] < oo will hold as long as sup fc>1 maxjgj l^ 11 < oo and sup fc>1 C^-i < 
oo. Note that when Sk-i = (fe-i, the gradient estimate becomes g^ = Ck-i^-{i t =i}- That is, 
in this case we see that in order to be able to calculate g^i, we need to be able to calculate 
Ck-i efficiently. 



Choosing the potential *S>. The efficient sampling of the gradient is not the only practical 
issue, since the choice of the Legendre function and the convex set K may also cause some 
complications. For example, if *$>(x) = X^iex Xi ( ma ' i — -0> then the resulting algorithm 
is exponential weighting, and one needs to store and update |X| weights, which is clearly 
infeasible if |X| is very large (or infinite). On the other hand, if ^(x) = gH^Hf; and we project 

2 For completeness, the calculations are given in Section B in the appendix. 



7 



Algorithm 2 Projected stochastic gradient algorithm. 



1: Initialization: ^(x) = h\\x\\2, &\ = for alH E X, k = 0, step sizes 

2: repeat 

3: k = k + l. 

4: Sample a gradient estimate g\. of g(0^ k ~ l ) randomly according to (14). 
5: ^)=n^ A2 (0( fe - 1 )- % _ 1 5 fc ). 

6: until convergence. 



to K = A2, the positive quadrant of the ^ 2 -ball (with A = [OjOo)" 1 ), we obtain a stochastic 
projected gradient method, shown in Algorithm 2. This is in fact the algorithm that we 
use in the experiments. Note that in (2) this corresponds to using p = 4/3. The reason we 
made this choice is because in this case projection is a simple scaling operation. Had we 
chosen K = Ai, the ^-projection would very often cancel many of the nonzero components, 
resulting in an overall slow progress. Based on the above calculations and Theorem 3.1 we 
obtain the following performance bound for our algorithm. 

Corollary 3.3. Assume that a* (9) is continuous on A2. Then there exists a C > such 
that \\-§gJ(0)\\i < C for all 9 E A 2 . Let B = \C 2 max i£ x,i<k<T fr"^- If Algorithm 2 is run 
for T steps with = r) = 1/V BT, k = 1, . . . , T, then, for all 9 E A2, 



E 



J 



1 T 



1) 



fe=l 



J{9)< 



Note that to implement Algorithm 2 efficiently, one has to be able to sample from Sk-i, 
and compute the importance sampling ratio gk,i/sk,i efficiently for any k and i. 



4 Example: Learning polynomial kernels 

In this section we show how our method can be applied in the context of multiple kernel learn- 
ing. We provide an example when the kernels in I are tensor products of a set of base kernels 
(this we shall call learning polynomial kernels) . The importance of this example follows from 
the observation of Gonen and Alpaydm (2011) that the non-linear kernel learning methods of 
Cortes et al. (2009), which can be viewed as a restricted form of learning polynomial kernels, 
are far the best MKL methods in practice and can significantly outperform state-of-the-art 
SVM with a single kernel or with the uniform combination of kernels. 

Assume that we are given a set of base kernels {«i, . . . , K r }. In this section we consider the 
set Kd of product kernels of degree at most D: Choose X = {(n, . . . , rj) ■ 0<d<D,l<ri<r} 
and the multi-index r 1: ^ = (n,...,^) E X defines the kernel K ri . d (x,x') = Yii=i K n [x, x') . 
For d = we define K ri . (x, x') = 1. Note that indices that are the permutations of each 
other define the same kernel. On the language of statistical modeling, K ri . d models interac- 
tions of order d between the features underlying the base kernels K\, . . . , K r . Also note that 
|X| = @(r D ), that is, the cardinality of X grows exponentially fast in D. 

We assume that p ri . d depends only on d, the order of interactions in K ri . d . By abusing 
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Algorithm 3 Polynomial kernel sampling. The symbol denotes the Hadamard prod- 
uct/power. 

1: Input: a G M n , the solution to the dual problem; kernel matrices {/Ci, . . . , fC r }; the 
degree D of the polynomial kernel, the weights (/jg, . . . , p 2 D ). 

3: 6(d') <- pf (M, S Qd ' ), d! G {0, . . . , D} 

d'=0' 



for i = 1 to d do 

r(M ge< 

tr(M S©(d-i+i)) 



f x tr(MSQ("-O0X: 3 O • n | 



4: Sample d from <5(-)/J^ =0 J(d') 
5 
6 

7: Sample Zj from 7r(- 

8: M <- M®K, Zi 

9: end for 

10: return (zi, . . . , z^) 



notation, we will write pd in the rest of this section to emphasize this. Our proposed 
algorithm to sample from qu-x,- is shown in Algorithm 3. The algorithm is written to return 
a multi-index (z\, . .. ,Zd) that is drawn from qk-i,-- The key idea underlying the algorithm 
is to exploit that QZ£=i K j) d = Yl ri . d eX K n-.<i- The correctness of the algorithm is shown in 
Section 4.1. In the description of the algorithm denotes the matrix entrywise product 
(a.k.a. Schur, or Hadamard product) and A &s denotes A ... A, and we set the priority 

s v y 

s 

of to be higher than that of the ordinary matrix product (by definition, all the entries of 
^ 0O are 1). 

Let us now discuss the complexity of Algorithm 3. For this, first note that computing all 
the Hadamard products S Qd ,d' = 0, . . . , D requires 0(Dn 2 ) computations. Multiplication 
with Mfc_i can be done in 0(n 2 ) steps. Finally, note that each iteration of the for loop takes 
0{rn 2 ) steps, which results in the overall worst-case complexity of 0(rn 2 D) if a*(6k-i) is 
readily available. The computational complexity of determining a*(9k-i) depends on the 
exact form of it, and can be done efficiently in many situations: if, for example, it is the 
squared loss, then a* can be computed in 0(n 3 ) time. An obvious improvement to the 
approach described here, however, would be to subsample the empirical loss L n , which can 
bring further computational improvements. However, the exploration of this is left for future 
work. 

Finally, note that despite the exponential cardinality of \I\, due to the strong algebraic 
structure of the space of kernels, Ck-i can be calculated efficiently. In fact, it is not hard to 
see that with the notation of the algorithm, Ck-i = Y2d'=o^(^')- This also shows that if pd 
decays "fast enough", Ck-i can be bounded independently of the cardinality of I. 

4.1 Correctness of the sampling procedure 

In this section we prove the correctness of Algorithm 3. 

As said earlier, we assume that p ri . , depends only on d, the order of interactions in k Ti , 

3 Using importance sampling, more general weights can also be accommodated, too without effecting the 
results as long as the range of weights (p ri . d ) is kept under control for all d. 
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and, by abusing notation, we will write pd to emphasize this. Let us now consider how one can 
sample from qk-i,-- The implementation relies on the fact that (%2j=i K j) d = Yln. d eT K ri-d- 

Remember that we denoted the kernel matrix underlying some kernel k by /C/%, and recall 
that /Cfc is an n x n matrix. For brevity, in the rest of this section for k = K ri . d we will write 
lC ri . d instead of 1C Kr . Define Mk—i = a*(#fc_i)a!*(#fc_i) T . Thanks to (12) and the rotation 
property of trace, we have 

9k,r 1:d = -/^ 2 tr(M fc _! K ri:d ) . (15) 

The plan to sample from qk-i : - = |<7fc,.|/ d ex l^n-J * s as f° nows: We first draw the order 
of interactions, < d < D. Given d = d, we restrict the draw of the random multi-index 
Ri-,d to the set {r\ : d € X}. A multi-index will be sampled in a ci-step process: in each step 
we will randomly choose an index from the indices of base kernels according to the following 
distributions. Let S = Ki + ... + fC r , let 



¥[d = d\J r i 



fe-i 



/£ 2 tr(M fe _iS® d ) 
and, with a slight abuse of notation, for any 1 < i < d define 



Ri = r i \J : k-i 1 d = d,Ri : i-i = r\-i-\ 

tr (M fc _i (©5=i/C rj .) 

>:;■ 1 tr (A4-1 (©5=1^) JC r , 

where we used the sequence notation (namely, s±- p denotes the sequence (s±, . . . , s p )). We 
have, by the linearity of trace and the definition of S that 

r 

tr (Mjfe_i (o}"!^) JC r > S ^) 

r>=l 

tr ( M fe _x (oj- 1 !^) S ^ +1 )) 



Thus, by telescoping, 



d, Ri-.d = ri; d \F k - 



; 2 tr(M fc _i K n . . . /C^ AC 



as desired. An optimized implementation of drawing these random variables is shown as 
Algorithm 3. The algorithm is written to return the multi-index R\-d- 



5 Experiments 

In this section we apply our method to the problem of multiple kernel learning in regression 
with the squared loss: L(w) = \ Y^t=i{fiv{ x t) ~ Vt) 2 , where (xt,yt) G R r x R are the input- 
output pairs in the data. In these experiments our aim is to learn polynomial kernels (cf. 
Section 4). 
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We compare our method against several kernel learning algorithms from the literature 
on synthetic and real data. In all experiments we report mean squared error over test sets. 
A constant feature is added to act as offset, and the inputs and output are normalized to 
have zero mean and unit variance. Each experiment is performed with 10 runs in which we 
randomly choose training, validation, and test sets. The results are averaged over these runs. 

5.1 Convergence speed 

In this experiment we examine the speed of convergence of our method and compare it against 
one of the fastest standard multiple kernel learning algorithms, that is, the p-norm multiple 
kernel learning algorithm of Kloft et al. (2011) with p = 2, 4 and the uniform coordinate 
descent algorithm that updates one coordinate per iteration uniformly at random (Nesterov, 
2010, 2012; Shalev-Shwartz and Tewari, 2011; Richtarik and Takac, 2011). We aim to learn 
polynomial kernels of up to degree 3 with all algorithms. Our method uses Algorithm 3 
for sampling with D = 3. The set of provided base kernels is the linear kernels built from 
input variables, that is, k^(x,x') = x^x'^, where x^ denotes the i th input variable. For 
the other two algorithms the kernel set consists of product kernels from monomial terms for 
D G {0,1,2,3} built from r base kernels, where r is the number of input variables. The 
number of distinct product kernels is C^/ 3 )- In this experiment for all algorithms we use 
ridge regression with its regularization parameter set to 10~ 5 . Experiments with other values 
of the regularization parameter achieved similar results. 

We compare these methods in four datasets from the UCI machine learning repository 
(Frank and Asuncion, 2010) and the Delve datasets 5 . The specifications of these datasets are 
shown in Table 1. We run all algorithms for a fixed amount of time and measure the value 



Table 1: Specifications of datasets used in experiments. 



Dataset 


# of variables 


Training size 


Validation size 


Test size 


german 


20 


350 


150 


500 


ionosphere 


34 


140 


36 


175 


ringnorm 


20 


500 


1000 


2000 


sonar 


60 


83 


21 


104 


splice 


60 


500 


1000 


1491 


waveform 


21 


500 


1000 


2000 



of the objective function (1), that is, the sum of the empirical loss and the regularization 
term. Figure 1 shows the performance of these algorithms. In this figure Stoch represents 
our algorithms, Kloft represents the algorithm of Kloft et al. (2011), and UCD represents 
the uniform coordinate descent algorithm. The results show that our method consistently 
outperforms the other algorithms in convergence speed. Note that our stochastic method 
updates one kernel coefficient per iteration, while Kloft updates ( r ^°) kernel coefficients 
per iteration. The difference between the two methods is analogous to the difference be- 
tween stochastic gradient vs. full gradient algorithms. While UCD also updates one kernel 

4 Note that p — 2 in Kloft et al. (2011) notation corresponds to p = 4/3 or v — 2 in our notation, which 
gives the same objective function that we minimize with Algorithm 2. 
5 See, www. cs . toronto . edu/~delve/data/datasets .html 
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time (sec.) time (sec.) time (sec.) time (sec.) 

Figure 1: Convergence comparison of our method and other algorithms. 

coefficient per iteration its naive method of selecting coordinates results in a slower over- 
all convergence compared to our algorithm. In the next section we compare our algorithm 
against several representative methods from the MKL literature. 

5.2 Synthetic data 

In this experiment we examine the effect of the size of the kernel space on prediction accuracy 
and training time of MKL algorithms. We generated data for a regression problem. Let r 
denote the number of dimensions of the input space. The inputs are chosen uniformly at 
random from [—1, l] r . The output of each instance is the uniform combination of 10 monomial 
terms of degree 3 or less. These terms are chosen uniformly at random among all possible 
terms. The outputs are noise free. We generated data for r £ {5, 10, 20, . . . , 100}, with 500 
training and 1000 test points. The regularization parameter of the ridge regression algorithm 
was tuned from {10 -8 , . . . , 10 2 } using a separate validation set with 1000 data points. 

We compare our method (Stoch) against the algorithm of Kloft et al. (2011) (Kloft), 
the nonlinear kernel learning method of Cortes et al. (2009) (Cortes), and the hierarchical 
kernel learning algorithm of Bach (2008) (Bach). 6 The set of base kernels consists of r linear 
kernels built from the input variables. Recall that the method of Cortes et al. (2009) only 
considers kernels of the form kq = (X)I=i 6i K i) > where D is a predetermined integer that 
specifies the degree of nonlinear kernel. Note that adding a constant feature is equivalent 
to adding polynomial kernels of degree less than D to the combination too. We provide all 
possible product kernels of degree to D to the kernel learning method of Kloft et al. (2011). 
For our method and the method of Bach (2008) we set the maximum kernel degree to D = 3. 

The results are shown in Figure 2, the mean squared errors are on the left plot, while the 
training times are on the right plot. In the training-time plot the numbers inside brackets 

6 While several fast MKL algorithms are available in the literature, such as those of Sonnenburg et al. (2006); 
Rakotomamonjy et al. (2008); Xu et al. (2010); Orabona and Luo (2011); Kloft et al. (2011), a comparison of 
the reported experimental results shows that from among these algorithms the method of Kloft et al. (2011) 
has the best performance overall. Hence, we decided to compare against only this algorithm. Also note 
that the memory and computational cost of all these methods still scale linearly with the number of kernels, 
making them unsuitable for the case we are most interested in. Furthermore, to keep the focus of the paper 
we compare our algorithm to methods with sound theoretical guarantees. As such, it remains for future work 
to compare with other methods, such as the infinite kernel learning of Gehler and Nowozin (2008), which lack 
such guarantees but exhibit promising performance in practice. 
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number of dimensions of input space [1,771] [12,341] [39,711] [91,881] [176,851] 

Figure 2: Comparison of kernel learning methods in terms of test error (left) and training 
time (right). 

indicate the total number of distinct product kernels for each value of r. This is the number of 
kernels fed to the Kloft algorithm. Since this method deals with a large number of kernels, 
it was possible to precompute and keep the kernels in memory (8GB) for r < 25. Therefore, 
we ran this algorithm for r < 25. For r > 25, we could use on-the-fly implementation of this 
algorithm, however that further increases the training time. Note that the computational cost 
of this method depends linearly on the number of kernels, which in this experiment, is cubic 
in the number of input variables since D = 3. While the standard MKL algorithms, such 
as Kloft, cannot handle such large kernel spaces, in terms of time and space complexity, 
the other three algorithms can efficiently learn kernel combinations. However their predictive 
accuracies are quite different. Note that the performance of the method of Cortes et al. 
(2009) starts to degrade as r increases. This is due to the restricted family of kernels that 
this method considers. The method of Bach (2008), which is well-suited to learn sparse 
combination of product kernels, performs better than Cortes et al. (2009) for higher input 
dimensions. Among all methods, our method performs best in predictive accuracy while its 
computational cost is close to that of the other two competitors. 

5.3 Real data 

In this experiment we aim to compare several MKL methods in real datasets. We compare 
our new algorithm (Stoch), the algorithm of Bach (2008) (Bach), and the algorithm of 
Cortes et al. (2009) (Cortes). For each algorithm we consider learning polynomial kernels 
of degree 2 and 3. We also include uniform combination of product kernels of degree D, 
i.e. kd = (Yli=i K i) D i f° r D G {1,2,3} (Uniform). To find out if considering higher- 
order interaction of input variables results in improved performance we also included a MKL 
algorithm to which we only feed linear kernels (D = 1). We use the MKL algorithm of Kloft 
et al. (2011) with p £ {1, 2} (Kloft). 

We compare these methods on six datasets from the UCI machine learning repository 
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IStoch (d=2) 
I Stoch (d=3) 
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I Bach (d=2) 
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I I Uniform (d=1) 
Uniform (d=2) 
Uniform (d=3) 





splice 



waveform 




Figure 3: Prediction error of different methods in the real data experiment 



and Delve datasets. In these datasets the number of dimensions of the input space is 20 
and above. The specifications of these datasets are shown in Table 1. The regularization 
parameter is selected from the set {10 -4 , . . . , 10 3 } for all methods using a validation set. The 
results are shown in Figure 3. 

Overall, we observe that methods that consider non-linear variable interactions (Stoch, 
Bach, and Cortes) perform better than linear methods (Kloft). Among non-linear meth- 
ods, Cortes performs worse than the other two. We believe that this is due to the restricted 
kernel space considered by this method. The performance of Stoch and Bach methods is 
similar overall. 

We observe that our method overfits when it considers kernels of degree 3. However, one 
can easily prevent overfitting by assigning larger p values to higher-degree kernels such that 
the stochastic algorithm selects lower-degree kernels more often. For this purpose, we repeat 
this experiment for D = 3 with a modified set of p values, where we use p\ = \ for kernels of 
degree 2 or less and p\ = 4 for kernels of degree 3. With the new p coefficients we observe an 
improvement in algorithm's performance. See Stoch (D = 3, prior) error values in Figure 3. 



6 Conclusion 

We introduced a new method for learning a predictor by combining exponentially many linear 
predictors using a randomized mirror descent algorithm. We derived finite-time performance 
bounds that show that the method efficiently optimizes our proposed criterion. Our proposed 
method is a variant of a randomized stochastic coordinate descent algorithm, where the main 
trick is the careful construction of an unbiased randomized estimate of the gradient vector 
that keeps the variance of the method under control, and can be computed efficiently when 
the base kernels have a certain special combinatorial structure. The efficiency of our method 
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was demonstrated for the practically important problem of learning polynomial kernels on a 
variety of synthetic and real datasets comparing to a representative set of algorithms from 
the literature. For this case, our method is able to compute an optimal solution in polynomial 
time as a function of the logarithm of the number of base kernels. To our knowledge, ours is 
the first method for learning kernel combinations that achieve such an exponential reduction 
in complexity while satisfying strong performance guarantees, thus opening up the way to 
apply it to extremely large number of kernels. Furthermore, we believe that our method is 
applicable beyond the case studied in detail in our paper. For example, the method seems 
extendible to the case when infinitely many kernels are combined, such as the case of learning 
a combination of Gaussian kernels. However, the investigation of this important problem 
remains subject to future work. 
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A Proofs 

In this section we present the proofs of Theorem 3.1 and Proposition 3.2. The proof of 
Theorem 3.1 is based on the standard proof of the convergence rate of the proximal point 
algorithm, see, for example, (Beck and Teboulle, 2003), or the proof of Proposition 2.2 of 
Nemirovski et al. (2009b), which carry over the same argument to solve very similar but less 
general problems. We also provide some improvements and simplifications at the end. Before 
giving the actual proof, we need the following standard lemma: 

Lemma A.l (Lemma 2.1 of Nemirovski et al. 2009b). Assume that ^ is a-strongly convex 
with respect to some norm \\ ■ \\ (i.e., (4) holds). Let 9\ 6 K n A° , 9 £ K n A, and g 6 M. d . 
Define 62 = &vgmm.Q, &KnA {(g, 9') + D^(9', #1)}. Then 



{ g ,9 1 -e)<D*{e,e 1 )-Dv{e,e 2 ) + 




We provide an alternate proof that is based on the so-called 3-DIV lemma. The 3-DIV 
lemma (e.g., Lemma 11.1, Cesa-Bianchi and Lugosi, 2006) allows one to express the sum of 
the divergences between the vectors u,v and v,w in terms of the divergence between u and 
w and an additional "error term", where u £ A, v,w £ A°: 

D^(u, v) + D^(v, w) = D^(u, w) + (Vip(w) — Vip(v),u — v) . 

Proof. Note that #2 £ A° due to behavior of \l/ at the boundary of A. Thus, ^ is differentiable 
at #2 and 

ViZ>*(0 2 A)=W(0 2 )-V^(0i), (16) 

where Vi denotes differentiation of Dq> w.r.t. its first variable. Let f(9') = (g, 0')+D^{0\ 61). 
By the optimality property of 92 and since 9 G K n A, we have 

<V/(0 2 ),02-0> < 0. 
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Plugging in the definition of / together with the identity (16) gives 

(0 + V^(0 2 )-V^(0i),0 2 -0)<O. (17) 

Now, by the 3-DIV Lemma, 

Dy(9, 9 2 ) + D 9 (9 2 , 0i) = D 9 (0, 0i) + (V*(0i) - W(0 2 ), 9 - 9 2 ) 

= D*(0, Or) + {g + W(0 2 ) - W(0i), 2 -0) + (g,0- 9 2 ) . 

Hence, by reordering and using the inequality (17) we get 

D 9 (9, 9 2 ) - D*(9, Oi) <{g,9- 8 2 ) - D^{8 2 , 9 X ) 

= (g,9i-9 2 )-Dy(9 2 ,9i) + (g,9-9i) 

<Mk + {gi e-8 1 ) , 

2a 

where in the last line we used Young's inequality 7 and that due to the strong convexity of ty, 
At (02, 01 ) > § || <9 2 — □ 

Theorem 3.1. Assume that \I/ is a-strongly convex with respect to some norm \\ ■ \\ (with 
dual norm \\ ■ for some a > 0, that is, for any 9 6 A°, 9' £ A 

#(0') - V(0) > (W(0), 0' - 0) + § ||0' - 0|| 2 . (4) 

Suppose, furthermore, that Algorithm 1 is run for T time steps. For < k < T — 1 let 
denote the a -algebra generated by 9i, ... ,9k- Assume that, for all 1 < k < T , g^ £ M d is an 
unbiased estimate o/ VJ(0( fc_1 )) given Fh-li that is, 

E[^|J- fc _ 1 ] = VJ(0( fe - 1 )). (5) 

Further, assume that there exists a deterministic constant B > such that for all 1 < k < T , 

^[\\9h\\l\ ?k-i] <B a.s. (6) 

Finally, assume that 5 = snj)0i eKn ^ ^(0') — ^(9^) is finite. Then, if = <J for all 
k > 1, it holds that 



E 

Furthermore, if 



inf J(9)<J 2 ^. (7) 



|ffifc||*<B' a.s. (8) 



/or some deterministic constant B' and r\k-\ = y Jpf f or a H k > 1 then, for any < e < 1, 
it holds with probability at least 1 — e that 



J (?f> <t_1) ) 



M m <J^ + J?^i. (9) 
OeKnA K J ~ V aT V aT K ' 



7 Young's inequality states that for any x,y vectors and a > 0, (x,y) < \\x\\*\\y\\ < | (iifil* ^ a ^y 



|2 1 
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Proof. Introduce the average learning rates rj k = %/X/fc=i Vk-i, k = I, . . . ,T, the averaged 
parameter estimates 

k=l 

and choose some 9* £ K n A. To prove the first part of the theorem, it suffices to show that 
the bound holds for J{W~^) - J(9*). Define g k = VJ (flC*" 1 )). By the convexity of J{9), 
we have 



fc=i 

T T 

4?) /a. aik-X) n*\ , sr^J T ) 



k=l 



Ed (ft,**- 15 - + EC - ^ (fe_1) - r < 18 ) 



Notice that the first term on the right hand side above is the sum of linearized losses appearing 
in the standard analysis of the proximal point algorithm with loss functions g k and learning 

— (T) 

rates and the second sum contains the term that depends on how well g k estimates 

the gradient gk- Thus, in this way, it is separated how the proximal point algorithm and the 
gradient estimate effect the convergence rate of the algorithm. The first sum can be bounded 
by invoking the standard bound for the proximal point algorithm (we will give the very short 
proof for completeness, based on Lemma A.l), while the second sum can be analyzed by 
noticing that, by assumption (5), its elements form an {J^j-adapted martingale-difference 
sequence. 

To bound the first sum, first note that the conditions of Lemma A.l are satisfied for 
Oi = 9^ k ~ l \9 = 9*,g = 4-ii?fc> since 9\ G K n A° (as mentioned beforehand, this follows 
from the behavior of \& at the boundary of A). Further, note that due to the so-called 
projection lemma (i.e., the -D^-projection of the unconstrained optimizer is the same as the 
optimizer of the constrained optimization problem), we can conclude that 9^ = 92, where #2 
is defined in Lemma A.l. Thus, Lemma A.l gives 

m-i (gk, o {k - 1] -<r)< d*{9* , e< k -V) - D^(9*,9^ k ) + . 

Summing the above inequality for k = 1,...,T, the divergence terms cancel each other, 
yielding 



rik-l\\9k\\l 



E£ ( 9k ,e^-8*) < * (d*(8*,9W)-D*(9*,9^) + 

(19) 

Let us now turn to the second sum. We start with developing a bound on the expected 
regret. For any 1 < k < T, by construction fj, T } 1 and 9^ k ~^ are J-fe_i-measurable. This, 
together with (5) gives 

E [4-1 (9k - 9k, 9* - 9^) JjfeJ = r£\ (g k - E [g k \ F k ^} , 9* - 9^) = . (20) 
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Combining this result with (18) and (19) yields 

1 



E 



j[eV))-j(e*) 



< 



< 



ELi »7fc-i 



fe=i 



where we used the tower rule to bring in the bound (6), the nonnegativity of Bregman 
divergences, and Af(0,0 (o) ) < tf(0) - *(0(°)); the latter holds as (V*(0^), - (o) ) > 

since 0(°) minimizes $ on Substituting = r) = \J^§f, k = 1, . . . , T finishes the proof 
of (7). 

To prove the high probability result (9), notice that thanks to (5) (g k — g^, 0* — 

is an {J^j-adapted martingale-difference sequence (cf. (20)). By the strong convexity of ^ 
we have 

_ 0*f < Er(0(*-i)) - *(0*) < <5. 



2 



Furthermore, conditions (5) and (8) imply that ||<7fc||* < -B' a.s., and so by (8) we have 
llSfc ~~ 9k\\* < 2\fB~' a.s. Then by Holder's inequality 

(g k - g k , 0* - 0^)| <\\g h - g k \l \\0* - 0^\\ < 2 ] J^. 

Thus, by the Hoeffding-Azuma inequality (see, e.g., Lemma A. 7, Cesa-Bianchi and Lugosi, 
2006), for any < e < 1 we have, with probability at least 1 — e, 



T 



-D\ < 



k=l 



ELi Vk-i \ 



B'5 

a 



Ei^v ( 22 ) 

\k=l / 



Combining (19) with (8) implies an almost sure upper bound on the first sum on the right 
hand side of (18) as in (21) with B' in place of B. This, together with (22) proves the required 

high probability bound (9) when substituting rjk-i = rj = J Jp^. 

□ 



Proposition 3.2. For 1 < t < n, let £* : E — > R denote the convex conjugate of it: £t(. v ) = 
sup TgK {vt — £t(r)}, v E M. For i £ I, recall that Ki(x,x') = (<pi(x),(j)i(x')), and let /Q = 
(Ki(xt, x s ))i<t,s<n be the n x n kernel matrix underlying k% and let fCg = Eiex be the 

P i 

kernel matrix underlying Kg = E?:ex% K i- Then, for any fixed 0, the minimizer w*(9) of 
</(•, 0) satisfies 



<(0) 



^Ta* t (9)Mx t ), iel : 



t=l 



where 



a 



\ 1 1 - 

: (0) = argmin < -a T fCga H — 7 £U 



-na t ) 



(10) 



(11) 
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Proof. By introducing the variables t = {jt)\<t<n G ^ n an d using the definition of L we can 
write the optimization problem (3) as the constrained optimization problem 



1 1 6- kiH 

In what follows, we call this problem the primal problem. The Lagrangian of this problem is 

where a = (at)i<t< n £ K n is the vector of Lagrange multipliers (or dual variables) associated 
with the n equality constraints. The Lagrange dual function, g (a) = inf^^ C(w,t, a), can be 
readily seen to satisfy 



Now, since the objective function of the primal problem is convex and the primal problem 
involves only affine equality constraints and the primal problem is clearly feasible, by Slater's 
condition (p. 226, Boyd and Vandenberghe, 2004), if a* (9) is the maximizer of g(a) then 

w* (9) = arg min inf C(w, r, a* (9)) 

( 2ll 112 n ^ 

pi\\wi\\i 



arg min E { 1 9 n. 2 ~ E a * ( w ^M x t)) 



•jew 



■ 20 % 
iex I t=i 



The minimum of the last expression is readily seen to be equal to the expression given in (10), 
thus finishing the proof. □ 



B Calculating the derivative of J{9) 

In this section we show that under mild conditions the derivative of J exist and we also give 
explicit forms. These derivations are quite standard and a similar argument can be found in 
the paper by (e.g.) Rakotomamonjy et al. (2008) specialized to the case when it is the hinge 
loss. 

As it is well-known, thanks to the implicit function theorem (e.g., Brown and Page, 
1970, Theorem 7.5.6), provided that J = J(w,8) is such that qqq w J(w, 9) and ^J{w,9) 
are continuous, the gradient of J{9) can be computed by evaluating the partial derivative 
§gJ{w, 9) of J(w, 9) with respect to 9 at (w*(6),0)), that is, d e J(9) = ^ J(w, 9)\ w=w * {e) .Note 
that the derivative is well-defined only if 9 > 0, that is, when no coordinates of 9 is zero, in 
which case 



»,<«•(•>,»> = . (24) 



If 9{ = for some i E X, we define the derivative in a continuous manner as 

to a'") < 25 > 

0'£A,0'>O 
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assuming that the limit exists. From (10) we get, for any i G 1, \\w*(9)\\2 = -^a*(9) T ICia*(9). 
Combining with (24) we obtain 



Now, by (25) and the implicit function theorem, a* (9) is a continuous function of 9 provided 
that the functions (1 < t < n) are twice continuously differentiable. This shows that under 
the conditions listed so far, the limit in (25) exists. In the application we shall be concerned 
with, these conditions can be readily verified. 
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